In this section of the data wrangling topic you will
Learn about the concept of tidy data
Learn how to convert data between wide and long formats using the tidyr 📦
Tidying data with tidyr
Tidy data
Tidy datasets provide a standardized way to link the structure of a dataset (its physical layout) with its semantics (its meaning). – Hadley Wickham
A way to arrange data that facilitates subsequent exploration, transformation, visualization, & modelling of the data where
each variable is a column; each column is a variable.
each observation is a row; each row is an observation.
each value is a cell; each cell is a single value.
Tidy data
How to represent data
Look at three different ways we could store data on Bovine tuberculosis (TB) incidents in cattle herds in Great Britain:
# A tibble: 1,017 × 4
country date n_herds n_cases
<chr> <date> <dbl> <dbl>
1 England 1996-01-31 91024 355
2 England 1996-02-29 90735 392
3 England 1996-03-31 90512 440
4 England 1996-04-30 90210 454
5 England 1996-05-31 89357 434
6 England 1996-06-30 88554 403
7 England 1996-07-31 87802 387
8 England 1996-08-31 87356 379
9 England 1996-09-30 86879 313
10 England 1996-10-31 86065 274
# ℹ 1,007 more rows
# A tibble: 2,034 × 4
country date type count
<chr> <date> <chr> <dbl>
1 England 1996-01-31 n_herds 91024
2 England 1996-01-31 n_cases 355
3 England 1996-02-29 n_herds 90735
4 England 1996-02-29 n_cases 392
5 England 1996-03-31 n_herds 90512
6 England 1996-03-31 n_cases 440
7 England 1996-04-30 n_herds 90210
8 England 1996-04-30 n_cases 454
9 England 1996-05-31 n_herds 89357
10 England 1996-05-31 n_cases 434
# ℹ 2,024 more rows
# A tibble: 1,017 × 3
country date rate
<chr> <date> <chr>
1 England 1996-01-31 355/91024
2 England 1996-02-29 392/90735
3 England 1996-03-31 440/90512
4 England 1996-04-30 454/90210
5 England 1996-05-31 434/89357
6 England 1996-06-30 403/88554
7 England 1996-07-31 387/87802
8 England 1996-08-31 379/87356
9 England 1996-09-30 313/86879
10 England 1996-10-31 274/86065
# ℹ 1,007 more rows
Which is most useful?
How to represent data
The first version is most useful
bovine
# A tibble: 1,017 × 4
country date n_herds n_cases
<chr> <date> <dbl> <dbl>
1 England 1996-01-31 91024 355
2 England 1996-02-29 90735 392
3 England 1996-03-31 90512 440
4 England 1996-04-30 90210 454
5 England 1996-05-31 89357 434
6 England 1996-06-30 88554 403
7 England 1996-07-31 87802 387
8 England 1996-08-31 87356 379
9 England 1996-09-30 86879 313
10 England 1996-10-31 86065 274
# ℹ 1,007 more rows
We have direct access to the variables of interest
# A tibble: 87 × 3
# Groups: country [3]
country year total_cases
<chr> <chr> <dbl>
1 England 1996 4461
2 England 1997 4921
3 England 1998 6790
4 England 1999 7668
5 England 2000 8525
6 England 2001 8318
7 England 2002 17443
8 England 2003 20466
9 England 2004 19304
10 England 2005 22351
# ℹ 77 more rows
Wide or long?
Data can be in what we call wide format — where multiple observations are in a single row, or
Data can be in long format — where where each row corresponds to a single observation
We can usually make tidy data wider or longer to suit our needs
# A tibble: 87 × 14
# Groups: country, year [87]
country year Jan Feb Mar Apr May Jun Jul Aug Sep Oct
<chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 England 1996 355 392 440 454 434 403 387 379 313 274
2 England 1997 339 367 392 443 451 428 427 398 413 401
3 England 1998 497 539 608 610 622 627 586 569 525 526
4 England 1999 564 650 710 741 765 734 685 634 562 509
5 England 2000 632 720 779 783 766 734 706 674 653 665
6 England 2001 768 780 763 766 738 707 680 636 620 606
7 England 2002 723 960 1182 1423 1561 1612 1647 1637 1630 1619
8 England 2003 1808 1849 1991 1997 1915 1820 1707 1578 1501 1424
9 England 2004 1497 1553 1700 1798 1796 1773 1688 1584 1471 1427
10 England 2005 1699 1913 2059 2098 2039 1946 1842 1753 1709 1719
# ℹ 77 more rows
# ℹ 2 more variables: Nov <dbl>, Dec <dbl>
You’ll often see data entered / stored this way
Long format
Data stored in databases are often stored in long (normalised forms)
# A tibble: 2,034 × 4
country date type count
<chr> <date> <chr> <dbl>
1 England 1996-01-31 n_herds 91024
2 England 1996-01-31 n_cases 355
3 England 1996-02-29 n_herds 90735
4 England 1996-02-29 n_cases 392
5 England 1996-03-31 n_herds 90512
6 England 1996-03-31 n_cases 440
7 England 1996-04-30 n_herds 90210
8 England 1996-04-30 n_cases 454
9 England 1996-05-31 n_herds 89357
10 England 1996-05-31 n_cases 434
# ℹ 2,024 more rows
Pivoting
Converting between a wide and long representations is called pivoting
gather() & spread()
An earlier implementation in tidyr 📦 used
gather()
spread()
roughly corresponding to
pivot_longer()
pivot_wider()
Many examples online and in 1st edition of r4ds used gather() and spread() but the pivot_*() functions are to be preferred now
pivot_longer()
How does pivoting data to a longer format work?
Say we have a data set with three dogs (A, B, and C) and we have taken their blood pressure twice each
# A tibble: 6 × 3
id measurement value
<chr> <chr> <dbl>
1 A bp1 100
2 A bp2 120
3 B bp1 140
4 B bp2 115
5 C bp1 120
6 C bp2 125
pivot_longer()
The id column is already a variable
It needs to be repeated once per column that we are pivoting
Here we are pivoting two columns, so we repeat each idtwice
pivot_longer()
The column names from dogs:
bp1
bp2
become the values in a new column, whose name was give by the names_to argument: "measurement"
The original column names need to be repeated once per row of the original data
pivot_longer()
The cell values in the original data also become a new variable
The name of this new variable is given by the values_to argument: "value"
They are unwound, row by row
pivot_longer()
dogs |>pivot_longer(cols = bp1:bp2, # which columns are we pivoting?names_to ="measurement", # variable to create in long formvalues_to ="value"# variable to hold the cell values )
# A tibble: 6 × 3
id measurement value
<chr> <chr> <dbl>
1 A bp1 100
2 A bp2 120
3 B bp1 140
4 B bp2 115
5 C bp1 120
6 C bp2 125
More complicated settings
Sometimes you will have data like this, with multiple variables per column
We have variables named h_1996_Jan and c_1996_Feb, where
h indicates if the data are number of herds
c indicates if the data are number of cases
pivot_longer()
bovine_long <- bovine_wide |>pivot_longer(cols =!country, # which columns are we pivoting?names_to =c("type", "year", "month"), # variables to create in long formnames_sep ="_", # split names at _values_to ="n"# variable to hold the cell values )bovine_long
# A tibble: 2,034 × 5
country type year month n
<chr> <chr> <chr> <chr> <dbl>
1 England h 1996 Jan 91024
2 England c 1996 Jan 355
3 England h 1996 Feb 90735
4 England c 1996 Feb 392
5 England h 1996 Mar 90512
6 England c 1996 Mar 440
7 England h 1996 Apr 90210
8 England c 1996 Apr 454
9 England h 1996 May 89357
10 England c 1996 May 434
# ℹ 2,024 more rows
pivot_longer()
Conceptually this is not too different from what we already saw
Instead of the columns names pivoting into a single column, they pivot to multiple columns
pivot_wider()
Having now separated the data into a long format we should see that there are multiple variables in the type column
bovine_long |>distinct(type)
# A tibble: 2 × 1
type
<chr>
1 h
2 c
The observations are at the level of country, by year and month
But the data for each observation are spread over two rows
pivot_wider()
We can use pivot_wider() to pivot the data to a wider format to create n_herds and n_cases columns
bovine_long |>pivot_wider(id_cols =c(country, year, month), # which columns are we *not* pivotingnames_from = type, # column with names of new variablesvalues_from = n # column with values for new variables ) |>rename(n_herds = h, n_cases = c) # rename for nicer names
# A tibble: 1,017 × 5
country year month n_herds n_cases
<chr> <chr> <chr> <dbl> <dbl>
1 England 1996 Jan 91024 355
2 England 1996 Feb 90735 392
3 England 1996 Mar 90512 440
4 England 1996 Apr 90210 454
5 England 1996 May 89357 434
6 England 1996 Jun 88554 403
7 England 1996 Jul 87802 387
8 England 1996 Aug 87356 379
9 England 1996 Sep 86879 313
10 England 1996 Oct 86065 274
# ℹ 1,007 more rows
pivot_wider()
To see how pivot_wider() works, we’ll revisit the small dog blood pressure data set, but this time we have only two dogs
We want to create a data frame with column names taken from "measurement" and fill the cells with the values from "value"
pivot_wider()
We want to create a data frame with columns names taken from "measurement" and fill the cells with the values from "value"
dogs2 |>pivot_wider(names_from = measurement, # which column(s) to take the names of new columns fromvalues_from = value # which column to take the values for the cells from )
# A tibble: 2 × 4
id bp1 bp2 bp3
<chr> <dbl> <dbl> <dbl>
1 A 100 120 105
2 B 140 115 NA
How pivot_wider() works — I
First, pivot_wider() has to identify what the new columns will be
dogs2 |>distinct(measurement) |>pull()
[1] "bp1" "bp2" "bp3"
How pivot_wider() works — II
By default, the rows in the output are determined from those variables not going to new variables (names) or values
These are the id_cols and should uniquely identify a row in the new (wider) data frame
Here we only have 1 ID column
dogs2 |>select(-measurement, -value) |>distinct()
# A tibble: 2 × 1
id
<chr>
1 A
2 B
How pivot_wider() works — III
pivot_wider() then uses these combinations to create an empty data frame
# A tibble: 2 × 4
id bp1 bp2 bp3
<chr> <lgl> <lgl> <lgl>
1 A NA NA NA
2 B NA NA NA
Where bp1 etc were the distinct values of measurement
How pivot_wider() works — IV
pivot_wider() then fills in the new data frame with the relevant values
Not every dog had three BP measurements, so there is no value to fill in the data frame with
In that case the value is NA (missing)
dogs2 |>pivot_wider(names_from = measurement, # which column(s) to take the names of new columns fromvalues_from = value # which column to take the values for the cells from )
# A tibble: 2 × 4
id bp1 bp2 bp3
<chr> <dbl> <dbl> <dbl>
1 A 100 120 105
2 B 140 115 NA